NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Testing high-dimensional multinomials with applications to text analysis

https://doi.org/10.1093/jrsssb/qkae003

Cai, T Tony; Ke, Zheng T; Turner, Paxton (February 2024, Journal of the Royal Statistical Society Series B: Statistical Methodology)

Abstract Motivated by applications in text mining and discrete distribution inference, we test for equality of probability mass functions of K groups of high-dimensional multinomial distributions. Special cases of this problem include global testing for topic models, two-sample testing in authorship attribution, and closeness testing for discrete distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null hypothesis, is proposed. This parameter-free limiting null distribution holds true without requiring identical multinomial parameters within each group or equal group sizes. The optimal detection boundary for this testing problem is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyse two real-world datasets to examine, respectively, variation among customer reviews of Amazon movies and the diversity of statistical paper abstracts.
more » « less
Full Text Available
Near-optimal fitting of ellipsoids to random points

Potechin, Aaron; Turner, Paxton M.; Venkat, Prayaag; Wein, Alexander S. (July 2023, Proceedings of Machine Learning Research)

Given independent standard Gaussian points v1, . . . , vn in dimension d, for what values of (n, d) does there exist with high probability an origin-symmetric ellipsoid that simultaneously passes through all of the points? This basic problem of fitting an ellipsoid to random points has connections to low-rank matrix decompositions, independent component analysis, and principal component analysis. Based on strong numerical evidence, Saunderson, Parrilo, and Willsky (Saunderson, 2011; Saunderson et al., 2013) conjectured that the ellipsoid fitting problem transitions from feasible to infeasible as the number of points n increases, with a sharp threshold at n ∼ d^2/4. We resolve this conjecture up to logarithmic factors by constructing a fitting ellipsoid for some n = d^2/polylog(d). Our proof demonstrates feasibility of the least squares construction of (Saunderson, 2011; Saunderson et al., 2013) using a convenient decomposition of a certain non-standard random matrix and a careful analysis of its Neumann expansion via the theory of graph matrices.
more » « less
Full Text Available
Phase transition for detecting a small community in a large network

Jin, Jiashun; Ke, Tracy Zheng; Turner, Paxton; Zhang, Anru R. (February 2023, The Eleventh International Conference on Learning and Representations)

Full Text Available
Phase transition for detecting a small community in a large network

Jin, Jiashun; Ke, Zheng Tracy; Turner, Paxton; Zhang, Anru (January 2023, The Eleventh International Conference on Learning and Representations)

Full Text Available
Gaussian discrepancy: A probabilistic relaxation of vector balancing

https://doi.org/10.1016/j.dam.2022.08.007

Chewi, Sinho; Gerber, Patrik; Rigollet, Philippe; Turner, Paxton (December 2022, Discrete Applied Mathematics)

Full Text Available
Efficient Interpolation of Density Estimators

Turner, Paxton and (January 2021, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics)

Full Text Available
A Statistical Perspective on Coreset Density Estimation

Turner, Paxton and (January 2021, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics)

Full Text Available
Efficient Interpolation of Density Estimators

Turner, Paxton; Liu, Jingbo; Rigollet, Philippe (January 2021, Proceedings of Machine Learning Research)
Banerjee, Arindam; Fukumizu, Kenji (Ed.)
Full Text Available
A Statistical Perspective on Coreset Density Estimation

Turner, Paxton; Liu, Jingbo; Rigollet, Philippe (January 2021, Proceedings of Machine Learning Research)
null (Ed.)
Full Text Available
Balancing Gaussian vectors in high dimension

Turner, Paxton; Meka, Raghu; Rigollet, Philippe (January 2020, Proceedings of Machine Learning Research)
Abernethy, Jacob; Agarwal, Agarwal (Ed.)
Motivated by problems in controlled experiments, we study the discrepancy of random matrices with continuous entries where the number of columns $$n$$ is much larger than the number of rows $$m$$. Our first result shows that if $$\omega(1) = m = o(n)$$, a matrix with i.i.d. standard Gaussian entries has discrepancy $$\Theta(\sqrt{n} \, 2^{-n/m})$$ with high probability. This provides sharp guarantees for Gaussian discrepancy in a regime that had not been considered before in the existing literature. Our results also apply to a more general family of random matrices with continuous i.i.d. entries, assuming that $$m = O(n/\log{n})$$. The proof is non-constructive and is an application of the second moment method. Our second result is algorithmic and applies to random matrices whose entries are i.i.d. and have a Lipschitz density. We present a randomized polynomial-time algorithm that achieves discrepancy $$e^{-\Omega(\log^2(n)/m)}$$ with high probability, provided that $$m = O(\sqrt{\log{n}})$$. In the one-dimensional case, this matches the best known algorithmic guarantees due to Karmarkar–Karp. For higher dimensions $$2 \leq m = O(\sqrt{\log{n}})$$, this establishes the first efficient algorithm achieving discrepancy smaller than $$O( \sqrt{m} )$$.
more » « less
Full Text Available

Search for: All records